In [1]:
import pandas as pd
import numpy as np
import plotly
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

Data Description¶

- VIN (1-10): The first 10 characters of the vehicle’s VIN (Vehicle Identification Number).
- County: The region where the vehicle’s owner is registered. Owners in Washington may have vehicles located in other states.
- City: The city where the owner lives.
- State: The state linked to the vehicle's registration.
- Postal Code: The 5-digit ZIP code of the owner’s address.
- Model Year: The year of the vehicle, decoded from the VIN.
- Make: The manufacturer of the vehicle, decoded from the VIN.
- Model: The vehicle’s model, decoded from the VIN.
- Electric Vehicle Type: Whether the vehicle is fully electric or a plug-in hybrid.
- CAFV Eligibility: Whether the vehicle qualifies as a Clean Alternative Fuel Vehicle (CAFV) based on Washington's 2019 House Bill 2042.
- Electric Range: How far the vehicle can go on electric power alone.
- Base MSRP: The lowest suggested retail price for the vehicle’s model.
- Legislative District: The Washington state legislative area where the owner lives.
- DOL Vehicle ID: A unique ID assigned by the Department of Licensing (DOL).
- Vehicle Location: The center of the vehicle's registered ZIP code.
- Electric Utility: The electric utility serving the owner’s address.
- 2020 Census Tract: The census tract code (state + county + tract) assigned by the U.S. Census Bureau.
In [2]:
df =  pd.read_csv("dataset.csv")
df.head()
Out[2]:
VIN (1-10) County City State Postal Code Model Year Make Model Electric Vehicle Type Clean Alternative Fuel Vehicle (CAFV) Eligibility Electric Range Base MSRP Legislative District DOL Vehicle ID Vehicle Location Electric Utility 2020 Census Tract
0 JTMEB3FV6N Monroe Key West FL 33040 2022 TOYOTA RAV4 PRIME Plug-in Hybrid Electric Vehicle (PHEV) Clean Alternative Fuel Vehicle Eligible 42 0 NaN 198968248 POINT (-81.80023 24.5545) NaN 12087972100
1 1G1RD6E45D Clark Laughlin NV 89029 2013 CHEVROLET VOLT Plug-in Hybrid Electric Vehicle (PHEV) Clean Alternative Fuel Vehicle Eligible 38 0 NaN 5204412 POINT (-114.57245 35.16815) NaN 32003005702
2 JN1AZ0CP8B Yakima Yakima WA 98901 2011 NISSAN LEAF Battery Electric Vehicle (BEV) Clean Alternative Fuel Vehicle Eligible 73 0 15.0 218972519 POINT (-120.50721 46.60448) PACIFICORP 53077001602
3 1G1FW6S08H Skagit Concrete WA 98237 2017 CHEVROLET BOLT EV Battery Electric Vehicle (BEV) Clean Alternative Fuel Vehicle Eligible 238 0 39.0 186750406 POINT (-121.7515 48.53892) PUGET SOUND ENERGY INC 53057951101
4 3FA6P0SU1K Snohomish Everett WA 98201 2019 FORD FUSION Plug-in Hybrid Electric Vehicle (PHEV) Not eligible due to low battery range 26 0 38.0 2006714 POINT (-122.20596 47.97659) PUGET SOUND ENERGY INC 53061041500
In [3]:
df.columns
Out[3]:
Index(['VIN (1-10)', 'County', 'City', 'State', 'Postal Code', 'Model Year',
       'Make', 'Model', 'Electric Vehicle Type',
       'Clean Alternative Fuel Vehicle (CAFV) Eligibility', 'Electric Range',
       'Base MSRP', 'Legislative District', 'DOL Vehicle ID',
       'Vehicle Location', 'Electric Utility', '2020 Census Tract'],
      dtype='object')

Data Cleaning¶

Dropping the null values as they are less than 5% of missing values¶

In [4]:
df.isna().sum()/len(df)
Out[4]:
VIN (1-10)                                           0.000000
County                                               0.000000
City                                                 0.000000
State                                                0.000000
Postal Code                                          0.000000
Model Year                                           0.000000
Make                                                 0.000000
Model                                                0.000178
Electric Vehicle Type                                0.000000
Clean Alternative Fuel Vehicle (CAFV) Eligibility    0.000000
Electric Range                                       0.000000
Base MSRP                                            0.000000
Legislative District                                 0.002539
DOL Vehicle ID                                       0.000000
Vehicle Location                                     0.000213
Electric Utility                                     0.003933
2020 Census Tract                                    0.000000
dtype: float64
In [5]:
df.dropna(inplace=True)

dropping the unwanted columns as they are contain values which are all unique and doesn't give any useful information for analysis¶

In [6]:
df.columns
Out[6]:
Index(['VIN (1-10)', 'County', 'City', 'State', 'Postal Code', 'Model Year',
       'Make', 'Model', 'Electric Vehicle Type',
       'Clean Alternative Fuel Vehicle (CAFV) Eligibility', 'Electric Range',
       'Base MSRP', 'Legislative District', 'DOL Vehicle ID',
       'Vehicle Location', 'Electric Utility', '2020 Census Tract'],
      dtype='object')
In [7]:
df.shape
Out[7]:
(112152, 17)
In [8]:
df.nunique()
Out[8]:
VIN (1-10)                                             7522
County                                                   39
City                                                    435
State                                                     1
Postal Code                                             516
Model Year                                               20
Make                                                     34
Model                                                   114
Electric Vehicle Type                                     2
Clean Alternative Fuel Vehicle (CAFV) Eligibility         3
Electric Range                                          101
Base MSRP                                                30
Legislative District                                     49
DOL Vehicle ID                                       112152
Vehicle Location                                        516
Electric Utility                                         73
2020 Census Tract                                      1760
dtype: int64
In [9]:
import re

def xed(x):
    k,j = re.findall("\s.*",x)[0].strip(" (").strip(")").split(" ")
    return (float(k),float(j))
In [10]:
xed(df["Vehicle Location"][2])
Out[10]:
(-120.50721, 46.60448)
In [11]:
df.columns
Out[11]:
Index(['VIN (1-10)', 'County', 'City', 'State', 'Postal Code', 'Model Year',
       'Make', 'Model', 'Electric Vehicle Type',
       'Clean Alternative Fuel Vehicle (CAFV) Eligibility', 'Electric Range',
       'Base MSRP', 'Legislative District', 'DOL Vehicle ID',
       'Vehicle Location', 'Electric Utility', '2020 Census Tract'],
      dtype='object')
In [12]:
df["coordinates"] = df["Vehicle Location"].apply(lambda x: xed(x))
In [13]:
df.drop(columns=['DOL Vehicle ID','Vehicle Location'],inplace=True)

EDA¶

Univariate¶

In [14]:
df["County"].value_counts().head(10).plot(kind="bar",color="green")
plt.title("The top 10 counties by vechiles registered count")
plt.show()
No description has been provided for this image

> From the above we can see the top 10 counties where majority of people are using electronic vehicles as a transport means. The King county has the largest no.of vehicles registered.¶

In [15]:
df["Make"].value_counts().head(10).plot(kind="bar",color="red")
plt.title("The top 10 vehicle brands by vehicles in use count")
plt.show()
No description has been provided for this image
In [16]:
df["Make"].value_counts(ascending=True).head()
Out[16]:
Make
TH!NK              3
BENTLEY            3
AZURE DYNAMICS     7
GENESIS           18
FISKER            19
Name: count, dtype: int64

> From the above we can observe that Tesla is dominating the electric vehicles industry with such a huge customer base, while some brands like THINK & BENTLY have such a low presence in the market.¶

In [17]:
df["Electric Vehicle Type"].value_counts().plot(kind="pie",color="yellow",autopct="%.2f",shadow=True,explode=[0.1,0])
plt.title("Distribution of Electric Vehicle types in-use")
plt.ylabel("")
plt.show()
No description has been provided for this image

> From the above we can observe that Battery based Electric Vehicles are more in use than Plug-in Hybrid based Electric Vehicles.¶

In [18]:
sns.histplot(df["Model Year"],kde=True,color="blue",bins=20)
plt.title("Distribution of vehicles manufactured year")
plt.show()
No description has been provided for this image

> From the above we can observe that majority of the people are using vehicles that were manufactured in the years: 2021 - 2022¶

In [19]:
sns.displot(df["Electric Range"],kde=True,color="purple")
plt.title("The distribution of the battery range of the vehicles")
plt.show()
No description has been provided for this image
In [20]:
df['Clean Alternative Fuel Vehicle (CAFV) Eligibility'].value_counts().plot(kind="pie",autopct="%.2f",explode=[0.02,0.02,0.01])
plt.title("Distribution of CAFV Eligibility")
plt.ylabel("")
plt.show()
No description has been provided for this image
In [21]:
df.groupby("Clean Alternative Fuel Vehicle (CAFV) Eligibility")["Electric Range"].agg(["min","max","mean","count"])
Out[21]:
min max mean count
Clean Alternative Fuel Vehicle (CAFV) Eligibility
Clean Alternative Fuel Vehicle Eligible 30 337 163.822194 58395
Eligibility unknown as battery range has not been researched 0 0 0.000000 39097
Not eligible due to low battery range 6 29 19.363847 14660

> From the above, we can observe that around 48% of the cars have a battery range 0-30 which might be due to some vehicles not having the eligibility due to low battery percentage and others not having their battery range researched. The reason for that can be because of the vehicle being an old model which were manufactured before the electric vehicles era.¶

In [22]:
df["Model"].value_counts().head(10).plot(kind="barh",color="gold")
plt.title("The top 10 vehicle models by vehicles in-use count")
plt.show()
No description has been provided for this image

> From the above we can observe that MODEL 3, MODEL Y and LEAF are the models with highest demand in the market¶

In [23]:
df["Electric Utility"].value_counts().head(10).plot(kind="barh",color="navy")
plt.title("The top 10 power suppliers for the electric vehicle owners")
plt.show()
No description has been provided for this image

> From the above we can observe the top 10 electric power providers respective to the customers locations and usage.¶

In [24]:
temp = df[df["Base MSRP"]!=0] # removing the values 0 as the vehicles can't be sold for free. Considering that their values might be missing
In [25]:
sns.histplot(data=temp,x="Base MSRP",kde=True)
plt.title("The distribution of Base MSRP of the vehicles")
plt.show()
No description has been provided for this image

> From the above we can say that most of the vehicles have an Base MSRP of ranging between 0 to 100,000.¶


Bivariate¶

In [26]:
df.groupby("Make")["Model"].nunique().sort_values(ascending=False).head(10).plot(kind="barh",color="teal")
plt.title("No.of varaints available in each of the top 10 brands by count")
plt.show()
No description has been provided for this image

> From the above we can observe the top 10 brands which have a large range of variants available/being used in the market.¶

In [27]:
sns.histplot(data=df,x="Model Year",hue="Electric Vehicle Type",bins=20, multiple='stack')
plt.title("No.of vehicles manufactured in each year along with the type of electric vehicle")
plt.show()
No description has been provided for this image

> From the above plot we can observe that there was a cut in the manufacturing of vehicles in the year 2020 which might be due to the covid pandemic post which the production rose back up. The vehicles industry had a bloom in the 2019 and again in 2023.¶

In [28]:
df.groupby("Model")["Electric Range"].mean().sort_values(ascending=False).head(10).plot(kind="bar",color="olive")
Out[28]:
<Axes: xlabel='Model'>
No description has been provided for this image

> From the above we can observe the top 10 vehicle models with a high average electric range.¶

In [29]:
temp = df[df["Electric Range"]!=0] # removing the values 0 as the battery based vehicles can't be having 0 Range. Assuming that their values might be missing.
In [30]:
sns.violinplot(data=temp,x="Electric Range",hue="Electric Vehicle Type")
plt.title("No.of vehicles manufactured in each year along with the type of electric vehicle")
plt.show()
No description has been provided for this image

> From the above we can see that the plug-in type vehicles mostly have a range of 0-50, while the battery type vehicles has a wide distribution from 170 to 300.¶


Create a Choropleth using plotly.express to display the number of EV vehicles based on location.¶

In [31]:
df["County"].unique()
Out[31]:
array(['Yakima', 'Skagit', 'Snohomish', 'Island', 'Thurston', 'Grant',
       'King', 'Kitsap', 'Whitman', 'Spokane', 'Cowlitz', 'Pierce',
       'Kittitas', 'Grays Harbor', 'Clark', 'Chelan', 'Whatcom', 'Benton',
       'Walla Walla', 'Mason', 'San Juan', 'Lewis', 'Jefferson',
       'Clallam', 'Douglas', 'Klickitat', 'Skamania', 'Adams', 'Franklin',
       'Okanogan', 'Stevens', 'Asotin', 'Ferry', 'Pacific', 'Columbia',
       'Wahkiakum', 'Lincoln', 'Pend Oreille', 'Garfield'], dtype=object)
In [32]:
# df['County'] = df['County'] + " County"
temp = df.groupby("County")["VIN (1-10)"].count().reset_index()
temp.columns = ['County','EV Count']
county_fips = {
    'Adams': '53001', 'Asotin': '53003', 'Benton': '53005', 'Chelan': '53007',
    'Clallam': '53009', 'Clark': '53011', 'Columbia': '53013', 'Cowlitz': '53015',
    'Douglas': '53017', 'Ferry': '53019', 'Franklin': '53021', 'Garfield': '53023',
    'Grant': '53025', 'Grays Harbor': '53027', 'Island': '53029', 'Jefferson': '53031',
    'King': '53033', 'Kitsap': '53035', 'Kittitas': '53037', 'Klickitat': '53039',
    'Lewis': '53041', 'Lincoln': '53043', 'Mason': '53045', 'Okanogan': '53047',
    'Pacific': '53049', 'Pend Oreille': '53051', 'Pierce': '53053', 'San Juan': '53055',
    'Skagit': '53057', 'Skamania': '53059', 'Snohomish': '53061', 'Spokane': '53063',
    'Stevens': '53065', 'Thurston': '53067', 'Wahkiakum': '53069', 'Walla Walla': '53071',
    'Whatcom': '53073', 'Whitman': '53075', 'Yakima': '53077'
}
temp['FIPS'] = temp['County'].map(county_fips)
In [33]:
temp
Out[33]:
County EV Count FIPS
0 Adams 34 53001
1 Asotin 48 53003
2 Benton 1376 53005
3 Chelan 654 53007
4 Clallam 728 53009
5 Clark 6681 53011
6 Columbia 13 53013
7 Cowlitz 569 53015
8 Douglas 221 53017
9 Ferry 27 53019
10 Franklin 365 53021
11 Garfield 4 53023
12 Grant 335 53025
13 Grays Harbor 402 53027
14 Island 1298 53029
15 Jefferson 698 53031
16 King 58980 53033
17 Kitsap 3828 53035
18 Kittitas 392 53037
19 Klickitat 175 53039
20 Lewis 431 53041
21 Lincoln 30 53043
22 Mason 547 53045
23 Okanogan 149 53047
24 Pacific 145 53049
25 Pend Oreille 32 53051
26 Pierce 8525 53053
27 San Juan 717 53055
28 Skagit 1228 53057
29 Skamania 139 53059
30 Snohomish 12412 53061
31 Spokane 2785 53063
32 Stevens 91 53065
33 Thurston 4109 53067
34 Wahkiakum 39 53069
35 Walla Walla 312 53071
36 Whatcom 2839 53073
37 Whitman 177 53075
38 Yakima 617 53077
In [34]:
geojson_url = 'https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json'

fig = px.choropleth(
    temp, 
    geojson=geojson_url, 
    locations='FIPS',  # Use FIPS codes for matching
    color='EV Count', 
    scope='usa',  
    title='Number of Electronic vehicles in Washington State County-wise',
    labels={'EV Count': 'EV Count'}
)

fig.update_geos(fitbounds="locations", visible=False)  # Focus on WA state
fig.show()

Create a Racing Bar Plot to display the animation of EV Make and its count each year.¶

In [35]:
import bar_chart_race as bcr
In [36]:
temp = df.groupby(["Make","Model Year"])["VIN (1-10)"].count().reset_index(name='Count')
temp = temp[(temp["Model Year"]>=2012 )&(temp["Model Year"]<=2024)] # taking the data of past 10 years
temp = temp.sort_values(by="Model Year")
In [37]:
fig = px.bar(
    temp,
    x='Count',
    y='Make',
    color='Make',  # Optional: Color bars by Make
    animation_frame='Model Year',  # Use Year for the animation
    orientation='h',  # Horizontal bar plot
    title='EV Make Count Over the Years',
    labels={'Count': 'EV Count', 'Make': 'Vehicle Make'},
    range_x=[0, temp['Count'].max() + 100]  # Adjust range for better visualization
)

fig.update_layout(
    xaxis=dict(showgrid=True, gridcolor='LightGray', tickformat=',d'),
    yaxis_title='Make',
    xaxis_title='Number of Vehicles',
    showlegend=False,
    title_x=0.5,
    margin=dict(l=50, r=50, t=50, b=50),
    width=800,
    height=600
)
fig.show()
In [38]:
ev_counts = temp.pivot_table(index='Model Year', columns='Make', values='Count', fill_value=0)
In [39]:
# Pivot the DataFrame to the right format
ev_counts = temp.pivot_table(index='Model Year', columns='Make', values='Count', fill_value=0)

bar_chart_params = {
    'df': ev_counts,                             # DataFrame for the chart
    'filename': "ev_racing_bar.mp4",           # Output file name
    'orientation': "h",                         # Horizontal bar plot
    'sort': "desc",                             # Sort bars in descending order
    'n_bars': 10,                               # Display the top 10 bars
    'steps_per_period': 50,                     # Steps for smoothness
    'period_length': 2000,                      # Duration of each period in milliseconds
    'title': 'EV Make Count Over the Years' # Title settings
}

# Create the racing bar plot and save it as a video
bcr.bar_chart_race(**bar_chart_params)

# Confirmation message
print("Video saved as ev_racing_bar.mp4")
C:\Users\vijay\anaconda3\Lib\site-packages\bar_chart_race\_make_chart.py:889: FutureWarning:

Series.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.

C:\Users\vijay\anaconda3\Lib\site-packages\bar_chart_race\_make_chart.py:286: UserWarning:

set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.

C:\Users\vijay\anaconda3\Lib\site-packages\bar_chart_race\_make_chart.py:287: UserWarning:

set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.

C:\Users\vijay\anaconda3\Lib\site-packages\bar_chart_race\_make_chart.py:226: UserWarning:

Some of your columns never make an appearance in the animation. To reduce color repetition, set `filter_column_colors` to `True`

Video saved as ev_racing_bar.mp4
In [ ]: